欢迎关注“小丫画图”公众号,同名知识星球等你加入

小丫微信: epigenomics E-mail:

作者:热心肠研究院 高春辉https://github.com/gaospecial

小丫编辑校验

需求描述

用R从形式上复现原图。

出自https://academic.oup.com/jnci/article-lookup/doi/10.1093/jnci/djy156

Figure 1. Plot of all alterations detected by plasma next-generation sequencing (n=210). Size of circles represents number of patients identified with an alteration.

图的解析

例文用来展示基因上的 SNP/Indel/CNV 突变(多态性差异)。 例如,在TP53基因中发现了最多的多态性,包括 CNV 差异 SPLICE 8 个,单核苷酸位点差异 R273H 5个等。用点的大小表示差异,所以很容易发现常见的多态性差异。

稍作观察便可发现,本图是一个“圆环套圆环”的布局,中心处在中央,下一级的项目分别处在外环。图中只有二环,如果要扩展成五环,多显示几个层次应该也不错。

原文用的是在线工具

推测原文是用这个工具画的:FuncTreehttps://bioviz.tokyo/functree/,能画出类似的图,用来展示基因组数据的关系。

缺点:输入数据格式复杂非常复杂。需要针对每一个点做有针对性的设置。感兴趣的小伙伴去尝试一下吧~

如何用R实现

小丫发现tidytuesday2019-11-12的图跟原图很像,代码https://github.com/spren9er/tidytuesday/blob/master/tidytuesday_201946_cran_packages.r,输入数据https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-12/loc_cran_packages.csv

参考这套代码,用 ggraph 来画这个图。为此,花了几天时间仔细研究了ggraph包,写下了一篇长文:一文读懂 ggraph 的使用

应用场景

展示层级结构。例如:

注:

环境设置

加载R包

library(clusterProfiler)
## 
## clusterProfiler v3.16.1  For help: https://guangchuangyu.github.io/software/clusterProfiler
## 
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
## 
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:stats':
## 
##     filter
library(GOplot)
## Loading required package: ggplot2
## Loading required package: ggdendro
## Loading required package: gridExtra
## Loading required package: RColorBrewer
library(tidyverse)
## ─ Attaching packages ──────────────────── tidyverse 1.3.0 ─
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ✓ purrr   0.3.4
## ─ Conflicts ───────────────────── tidyverse_conflicts() ─
## x dplyr::combine()  masks gridExtra::combine()
## x dplyr::filter()   masks clusterProfiler::filter(), stats::filter()
## x dplyr::lag()      masks stats::lag()
## x purrr::simplify() masks clusterProfiler::simplify()
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggraph)
library(tidygraph)
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor

加载自定义函数

source(file = "gather_graph_node.R")
source(file = "gather_graph_edge.R")

输入文件的获得

如果你的数据已经整理成very_easy_input.csv的格式,就可以跳过这步,直接进入“输入文件预处理”。

先用clusterProfiler做KEGG的GSEA,然后用例图的形式展示结果。

gsym.fc <- read.table("easy_input_rnk.txt", header = T)
dim(gsym.fc)
head(gsym.fc)

#把gene symbol转换为ENTREZ ID
#此处物种是人,其他物种的ID转换方法,请参考FigureYa52GOplot
gsym.id <- bitr(gsym.fc$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")

#让基因名、ENTREZID、foldchange对应起来
gsym.fc.id <- merge(gsym.fc, gsym.id, by="SYMBOL", all=F)

#按照foldchange排序
gsym.fc.id.sorted <- gsym.fc.id[order(gsym.fc.id$logFC, decreasing = T),]

#获得ENTREZID、foldchange列表,做为GSEA的输入
id.fc <- gsym.fc.id.sorted$logFC
names(id.fc) <- gsym.fc.id.sorted$ENTREZID

#这一条语句就做完了KEGG的GSEA分析
kk <- gseKEGG(id.fc, organism = "hsa")
dim(kk)

#把ENTREZ ID转为gene symbol,便于查看通路里的基因
kk.gsym <- setReadable(kk, 'org.Hs.eg.db', #物种
                    'ENTREZID')

#可以用kk.gsym作为输入,用clusterProfiler画图
#用法看这里https://yulab-smu.github.io/clusterProfiler-book/chapter12.html
#gsym.fc.l <- gsym.fc$logFC
#names(gsym.fc.l) <- gsym.fc$SYMBOL
#cnetplot(sortkk, foldChange = gsym.fc.l, circular = TRUE)

#按照enrichment score从高到低排序,取前5(up)和后5(down)
#sortkk <- kk.gsym[order(kk.gsym@result$enrichmentScore, decreasing = T),][c(1:5, (nrow(kk.gsym)-5):nrow(kk.gsym)),]
#这里提取感兴趣的3个通路,数量太多拥挤的话不好看基因名
sortkk <- kk.gsym[kk.gsym@result$Description %like% "DNA" | 
                    kk.gsym@result$Description %like% "cycle" | 
                    kk.gsym@result$Description %like% "p53",]

#把富集分析结果整理为GOplot所需的格式
go <- data.frame(Category = "KEGG",
                 ID = sortkk$ID,
                 Term = sortkk$Description, 
                 Genes = gsub("/", ", ", sortkk$core_enrichment), 
                 adj_pval = sortkk$p.adjust)

#基因变化倍数
genelist <- data.frame(ID = gsym.fc.id$SYMBOL, logFC = gsym.fc.id$logFC)

#把富集分析和倍数整合在一起
circ <- circle_dat(go, genelist)
head(circ)
#可以用circ作为输入,用GOplot画图
#用法看这里https://wencke.github.io/
#GOCircle(circ)

#保存到文件
write.csv(circ[,c(3,5,6)],"very_easy_input.csv", quote = F, row.names = F)

输入文件预处理

very_easy_input.csv,这里以上面的富集分析结果为例,展示通路和通路里的基因变化倍数FC。三列依次是通路-基因-倍数,可以自由替换成“应用场景”里其他需要展示的信息。

gene_special.txt,要突出显示的基因。第一列是基因名,第二列是类型(例如基因家族信息)。

df <- read.csv("very_easy_input.csv")
head(df)
##              term genes    logFC
## 1 DNA replication  MCM5 2.010821
## 2 DNA replication  MCM2 1.956162
## 3 DNA replication  MCM6 1.439526
## 4 DNA replication  MCM4 1.422334
## 5 DNA replication  FEN1 1.385611
## 6 DNA replication  RFC4 1.266485
geneSpecial <- read.table("gene_special.txt", header = T)
geneCol <- geneSpecial$Type
names(geneCol) <- geneSpecial$Gene
geneCol
##    MCM6    MCM2     CD6    CDK4   CHEK2   CHEK1 
## "type1" "type1" "type2" "type2" "type3" "type3"

图由两个部分组成,节点(node)和边(edge)。

要从上面的数据框中采集节点和边的信息。

为此,我分别写了两个函数:gather_graph_node()gather_graph_edge() 来完成这一个任务(前面已加载)。

这两个函数的参数设置借鉴了 treemap() 的实现方式。

为了确保 node.name 的唯一性,在图中使用了长名,而把原有的名字放在 node.short_name 中去了。

node.level 则用来指示节点应该处于第几个圆环。

节点的属性统一以 node 作为前缀,而边的属性则以 edge 作为前缀。

nodes <- gather_graph_node(df, index = c("term", "genes"), value = "logFC", root="all")
## `summarise()` ungrouping output (override with `.groups` argument)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(dots)` instead of `dots` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## `summarise()` regrouping output by 'term' (override with `.groups` argument)
edges <- gather_graph_edge(df, index = c("term", "genes"), root = "all")
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(index)` instead of `index` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## `summarise()` ungrouping output (override with `.groups` argument)
nodes <- nodes %>% mutate_at(c("node.level","node.branch"),as.character)
head(nodes, 10)
##                node.name  node.size node.level node.count       node.short_name
## 1                    all 115.692338        all          1                   all
## 2             Cell cycle  70.226424       term         46            Cell cycle
## 3        DNA replication  21.413204       term         21       DNA replication
## 4  p53 signaling pathway  24.052710       term         16 p53 signaling pathway
## 5        Cell cycle/BUB1   1.185457      genes          1                  BUB1
## 6       Cell cycle/BUB1B   1.809404      genes          1                 BUB1B
## 7       Cell cycle/CCNA2   2.446934      genes          1                 CCNA2
## 8       Cell cycle/CCNB1   2.046811      genes          1                 CCNB1
## 9       Cell cycle/CCNB2   3.125239      genes          1                 CCNB2
## 10      Cell cycle/CCNE1   1.609160      genes          1                 CCNE1
##              node.branch
## 1                    all
## 2             Cell cycle
## 3        DNA replication
## 4  p53 signaling pathway
## 5             Cell cycle
## 6             Cell cycle
## 7             Cell cycle
## 8             Cell cycle
## 9             Cell cycle
## 10            Cell cycle
head(edges, 10)
## # A tibble: 10 x 2
##    from            to                   
##    <chr>           <chr>                
##  1 all             Cell cycle           
##  2 all             DNA replication      
##  3 all             p53 signaling pathway
##  4 DNA replication DNA replication/MCM5 
##  5 DNA replication DNA replication/MCM2 
##  6 DNA replication DNA replication/MCM6 
##  7 DNA replication DNA replication/MCM4 
##  8 DNA replication DNA replication/FEN1 
##  9 DNA replication DNA replication/RFC4 
## 10 DNA replication DNA replication/PCNA
# 把要突出显示的基因类型信息加到nodes里
nodes$color <- "normal"
nodes[nodes$node.short_name %in% geneSpecial$Gene,]$color <- geneCol[nodes[nodes$node.short_name %in% geneSpecial$Gene,]$node.short_name]
nodes[nodes$node.short_name %in% geneSpecial$Gene,]
##                      node.name node.size node.level node.count node.short_name
## 20             Cell cycle/CDK4 0.7449589      genes          1            CDK4
## 23            Cell cycle/CHEK1 2.2365719      genes          1           CHEK1
## 24            Cell cycle/CHEK2 0.6951341      genes          1           CHEK2
## 32             Cell cycle/MCM2 1.9561624      genes          1            MCM2
## 36             Cell cycle/MCM6 1.4395262      genes          1            MCM6
## 53        DNA replication/MCM2 1.9561624      genes          1            MCM2
## 57        DNA replication/MCM6 1.4395262      genes          1            MCM6
## 79  p53 signaling pathway/CDK4 0.7449589      genes          1            CDK4
## 81 p53 signaling pathway/CHEK1 2.2365719      genes          1           CHEK1
## 82 p53 signaling pathway/CHEK2 0.6951341      genes          1           CHEK2
##              node.branch color
## 20            Cell cycle type2
## 23            Cell cycle type3
## 24            Cell cycle type3
## 32            Cell cycle type1
## 36            Cell cycle type1
## 53       DNA replication type1
## 57       DNA replication type1
## 79 p53 signaling pathway type2
## 81 p53 signaling pathway type3
## 82 p53 signaling pathway type3
nodes$color <- factor(nodes$color, levels = unique(nodes$color))

# 有了节点和边的数据,使用 `tbl_graph()` 便可以得到一个图。
graph <- tbl_graph(nodes, edges)

开始画图

自定义配色,直接出图

# 用 `ggraph` 出图
gc <- ggraph(graph, layout = 'dendrogram', circular = TRUE) + 
  # 使用 filter 参数去掉 root(前面设置为"all")节点及与其相连的边
  geom_edge_diagonal(aes(color = node1.node.branch,
                         filter=node1.node.level!="all"), 
                     alpha = 1/3,edge_width=1) + 
  geom_node_point(aes(size = node.size, 
                      color = node.branch,
                      filter=node.level!="all"), alpha = 1/3) + 
  scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
  theme(legend.position = "none") + #不画图例
  
  # 点和边的配色
  # 如果要改变边的配色,需要同时给边和点变色,否则会对应不上
  scale_edge_color_brewer(palette = "Set1") + #用?scale_color_brewer查看更多配色方案
  scale_color_brewer(palette = "Set1") +
  
  # 添加周围注释文字,此处是基因名gene
  geom_node_text(
    aes(
      x = 1.048 * x, #控制字跟点的距离
      y = 1.048 * y, #控制字跟点的距离
      label = node.short_name,
      angle = -((-node_angle(x, y) + 90) %% 180) + 90,
      filter = leaf,
      color = node.branch
      ),
    size = 6, hjust = 'outward') +
  
  # 添加内环文字,此处是通路名term
  geom_node_text(
    aes(label=node.short_name,
        filter = !leaf & (node.level != "all"),
        color = node.branch),
    fontface="bold",
    size=6,
    family="sans"
  ) + 
  theme(panel.background = element_rect(fill = NA)) +
  coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系

gc

ggsave("ccgraph_color.pdf", width = 14, height = 14)

在上面的图形中,线条的颜色由 geom_edge_diagonal(aes(color = node1.node.branch)) 指定。

node1.node.branch 指的是出发点(node1)的 node.branch 属性。如果要改变线条颜色,可以修改 nodes 表,添加一个属性(如 color) ,然后在 geom_edge_diagonal() 中将其映射到 color 上即可。

按例文配色,然后后期加背景色

gc1 <- ggraph(graph, layout = 'dendrogram', circular = TRUE) + 
  #画连线
  geom_edge_diagonal(aes(color = node2.color,
                         filter=node1.node.level!="all"), 
                     alpha = 0.5, #透明度
                     edge_width=2.5) + #连线的粗细
  scale_edge_color_manual(values = c("#61C3ED","red","purple","darkgreen")) + #自定义颜色

  #画点
  geom_node_point(aes(size = node.size,
                      filter=node.level!="all"), 
                  #alpha = 1/3,
                  color = "#61C3ED") + #统一为淡蓝色
  scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
  theme(legend.position = "none") + #不画图例
  
  # 添加周围注释文字,此处是基因名gene
  geom_node_text(
    aes(
      x = 1.05 * x, #控制字跟点的距离
      y = 1.05 * y, #控制字跟点的距离
      label = node.short_name,
      angle = -((-node_angle(x, y) + 90) %% 180) + 90,
      filter = leaf
      ),
    color="black", #统一为黑色字
    size = 6, hjust = 'outward') +
  
  # 添加内环文字,此处是通路名term
  geom_node_text(
    aes(label=node.short_name,
        filter = !leaf & (node.level != "all")
        ),
    color="black", #统一为黑色字
    fontface="bold",
    size=6,
    family="sans"
  ) + 
  theme(panel.background = element_rect(fill = NA)) + #背景透明色
  coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系

gc1

后期处理

保存到pdf文件,是矢量图,可以用Illustrator等软件编辑图形、文字和背景

ggsave("ccgraph.pdf",width = 14,height = 14)

多层嵌套

这套代码不仅可以画两层的图,理论上支持更多层(要不然怎么叫“圆环套圆环”呢?)。

下面是一个例子,这里使用了常见的微生物组数据集(这是一个随机生成的 OTU 表)。

#随机生成一套数据
n <- 1000
microbiome <- data.frame(
  otu = paste("OTU",1:n,sep="_"),
  phylum = sample(paste("phylum",1:5,sep="_"),n,replace = T),
  class = sample(paste("class",6:30,sep="_"),n,replace=T),
  order = sample(paste("order",31:80,sep="_"),n,replace = T),
  value = runif(n,min=1,max=1000)
)
head(microbiome)

#保存到文件,便于小白套用格式
write.csv(microbiome, "microbiome.csv", quote = F, row.names = F)

加载输入数据

microbiome.csv,想画几层就给几+1列。这里前4列对应4层,最后一列是最底层节点对应的数值。

microbiome <- read.csv("microbiome.csv", header = T)
index_micro <- c("phylum","class","order") #除了最低层以外的列名
nodes_micro <- gather_graph_node(microbiome,index=index_micro,
                                  root = "bacteria") #root名字自己随便取
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'phylum' (override with `.groups` argument)
## `summarise()` regrouping output by 'phylum', 'class' (override with `.groups` argument)
edges_micro <- gather_graph_edge(microbiome,index=index_micro,root = "bacteria")
## `summarise()` ungrouping output (override with `.groups` argument)

画图

graph_micro <- tbl_graph(nodes_micro,edges_micro)
ggraph(graph_micro,layout = "dendrogram",circular=T) +
  geom_edge_diagonal(aes(color = node1.node.branch,filter=node1.node.level!="bacteria", alpha = node1.node.level),edge_width=1) + 
  geom_node_point(aes(size = node.size, color = node.branch,filter=node.level!="bacteria"), alpha = 1/3) + 
  scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
  theme(legend.position = "none")+ #不画图例
  
  scale_edge_color_brewer(palette = "Set1") + #用?scale_color_brewer查看更多配色方案
  scale_color_brewer(palette = "Set1") +

  # 添加周围注释文字,此处是基因名gene
  geom_node_text(
    aes(
      x = 1.058 * x, #控制字跟点的距离
      y = 1.058 * y, #控制字跟点的距离
      label = node.short_name,
      angle = -((-node_angle(x, y) + 90) %% 180) + 90,
      filter = leaf,
      color = node.branch
      ),
    size = 1, hjust = 'outward') +
  
  # 添加内环文字,此处是通路名term
  geom_node_text(
    aes(label=node.short_name,
        filter = !leaf & (node.level == "phylum"),
        color = node.branch),
    fontface="bold",
    size=6,
    family="sans"
  ) + 
  theme(panel.background = element_rect(fill = NA)) +
  coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系

ggsave("ccgraph_microbiome.pdf", width = 14, height = 14)

参考资料

搞大你的点https://mp.weixin.qq.com/s/JNIncz3W-59yjGk2ibJWUw

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidygraph_1.2.0        ggraph_2.0.4           data.table_1.13.4     
##  [4] forcats_0.5.0          stringr_1.4.0          dplyr_1.0.2           
##  [7] purrr_0.3.4            readr_1.4.0            tidyr_1.1.2           
## [10] tibble_3.0.4           tidyverse_1.3.0        GOplot_1.0.2          
## [13] RColorBrewer_1.1-2     gridExtra_2.3          ggdendro_0.1.22       
## [16] ggplot2_3.3.2          clusterProfiler_3.16.1
## 
## loaded via a namespace (and not attached):
##   [1] fgsea_1.14.0         colorspace_2.0-0     ellipsis_0.3.1      
##   [4] ggridges_0.5.2       qvalue_2.20.0        fs_1.5.0            
##   [7] rstudioapi_0.13      farver_2.0.3         urltools_1.7.3      
##  [10] graphlayouts_0.7.1   ggrepel_0.9.0        bit64_4.0.5         
##  [13] AnnotationDbi_1.50.3 fansi_0.4.1          scatterpie_0.1.5    
##  [16] lubridate_1.7.9.2    xml2_1.3.2           splines_4.0.2       
##  [19] GOSemSim_2.14.2      knitr_1.30           polyclip_1.10-0     
##  [22] jsonlite_1.7.2       broom_0.7.3          GO.db_3.11.4        
##  [25] dbplyr_2.0.0         ggforce_0.3.2        BiocManager_1.30.10 
##  [28] compiler_4.0.2       httr_1.4.2           rvcheck_0.1.8       
##  [31] backports_1.2.1      assertthat_0.2.1     Matrix_1.3-0        
##  [34] cli_2.2.0            tweenr_1.0.1         htmltools_0.5.0     
##  [37] prettyunits_1.1.1    tools_4.0.2          igraph_1.2.6        
##  [40] gtable_0.3.0         glue_1.4.2           reshape2_1.4.4      
##  [43] DO.db_2.9            fastmatch_1.1-0      Rcpp_1.0.5          
##  [46] enrichplot_1.8.1     Biobase_2.48.0       cellranger_1.1.0    
##  [49] vctrs_0.3.6          xfun_0.19            rvest_0.3.6         
##  [52] lifecycle_0.2.0      DOSE_3.14.0          europepmc_0.4       
##  [55] MASS_7.3-53          scales_1.1.1         hms_0.5.3           
##  [58] parallel_4.0.2       yaml_2.2.1           memoise_1.1.0       
##  [61] downloader_0.4       triebeard_0.3.0      stringi_1.5.3       
##  [64] RSQLite_2.2.1        S4Vectors_0.26.1     BiocGenerics_0.34.0 
##  [67] BiocParallel_1.22.0  rlang_0.4.9          pkgconfig_2.0.3     
##  [70] evaluate_0.14        lattice_0.20-41      labeling_0.4.2      
##  [73] cowplot_1.1.0        bit_4.0.4            tidyselect_1.1.0    
##  [76] plyr_1.8.6           magrittr_2.0.1       R6_2.5.0            
##  [79] IRanges_2.22.2       generics_0.1.0       DBI_1.1.0           
##  [82] pillar_1.4.7         haven_2.3.1          withr_2.3.0         
##  [85] modelr_0.1.8         crayon_1.3.4         utf8_1.1.4          
##  [88] rmarkdown_2.6        viridis_0.5.1        progress_1.2.2      
##  [91] grid_4.0.2           readxl_1.3.1         blob_1.2.1          
##  [94] reprex_0.3.0         digest_0.6.27        gridGraphics_0.5-1  
##  [97] stats4_4.0.2         munsell_0.5.0        viridisLite_0.3.0   
## [100] ggplotify_0.0.5